Bi-directional ring-bus architecture for cordic-based matrix inversion

ABSTRACT

A method and a system is provided for Coordinate Rotation Digital Computer (CORDIC) based matrix inversion of input digital signal streams from multiple antennas using an bi-directional ring-bus architecture. The bi-directional ring bus includes a first ring bus having signals flow in a clockwise direction, and a second ring bus having signals flow in a counter-clockwise direction. An I/O controller is coupled to the first and the second ring bus, respectively. A plurality of processing elements (PEs), where each of the plurality of PEs is coupled to the first and the second ring bus, respectively, wherein each of the plurality of PEs includes at least one CORDIC core for performing CORDIC iterations on the plurality of input digital stream signals to produce inversed matrix signals.

FIELD OF THE TECHNOLOGY

The present application relates to the field of processing andequalization of signals received from multiple antennas, using QRdecomposition (QRD) based matrix inversion algorithms. More particularlythe present application provides a method and system which implements abi-directional ring-bus architecture using Coordinate Rotation DigitalComputer (CORDIC) cores to perform the complex task of QRD based matrixinversion of multiple input digital signal streams in Long TermEvolution (LTE) application.

BACKGROUND

Multiple input/multiple output (MIMO) antenna deployment has become oneof the most important techniques for enhancing system capacity andimproving receiver performance in fourth generation (4G) high-speedwireless communication, under the Long Term Evolution (LTE) standard andthe IEEE 802.16 WiMAX standard.

Signals received by each of the antenna ports are converted to a digitalsignal stream. Each digital signal stream is synchronized, having thecarrier frequency offset removed, digitally and automatically gaincontrolled, scaled, and associated to an estimated channel coefficient(H). The digital signal streams from the multiple antennas are thenequalized through a complex-valued mathematical operation, namely, amatrix inversion.

The complexity of the matrix inversion is related to the MIMO antenna'sdimension (i.e., number of antennas used). An increase in the MIMOantenna's dimension improves system performance by increasing datathroughput, but comes with an increased challenge on the algorithm,which performs the matrix inversion. The matrix inversion may beperformed through software-oriented and hardware-oriented matrixinversion algorithms, where each has its advantages and shortcomings.

Software-oriented matrix inversion algorithms, such as the Gaussian,Cofactor and Blockwise algorithms may be implemented via programmedinstructions on a digital signal processor (DSP). Software orientedmatrix inversion algorithms are flexible due to the programmable natureof the algorithms. These algorithms are able to achieve a high degree ofprecision. However, the software algorithms are not suitable for a largeantenna matrix size, because significant computing power is required.

A hardware oriented matrix inversion algorithm approach uses hard codedpipeline processing elements (PE) to perform complex valued matrixinversions. One well-known hardware oriented matrix inversionapplication utilizes a systolic array to perform matrix inversion on themultiple input digital signal streams. It is well known in the art thata systolic array processes signals in both the horizontal direction(West to East) and in the vertical direction (North to South)simultaneously.

FIGS. 1A and 1B illustrate an application of a systolic array (150)performing a 4×4 matrix inversion on LTE digital signal streams (132 ato 132 d). More specifically, FIG. 1A illustrates that the systolicarray (150) may carry out a known (QR Decomposition) QRD based matrixinversion algorithm in two stages, namely, QR decomposition operationsin the first triangular processing block (120), back-substitution (BS)and back-substitution delay (BSD) operations in the second triangularprocessing block (122). A total of 16 processing elements (PEs) and fourdelay units (−1A to −1D) (see FIG. 1A) may be employed in bothprocessing blocks (120, 122) of the systolic array (150). The PEs forthe QRD may be designated as QR-PE. The PEs for the back-substitutionmay be designated as BS-PE, and the PE for the back-substitution delaymay be designated as BSD-PE. Each of the processing elements (PEs)typically implements four CORDIC cores (not shown), arranged as twoseries cascaded CORDIC cores in parallel arrangement. The structure andfunction of a CORDIC core is well known in the art.

Referring to both FIGS. 1A and 1B, input digital signal streams (132 ato 132 d) are fed as input matrix A into the systolic array (150) onecolumn at a time at a constant rate. After the 14^(th) step, the firstelement of the inversed matrix A⁻¹ is output from the systolic array(150) as output streams (142 a to 142 d).

FIG. 1B illustrates that the input matrix (106) is deliberately skeweddue to the fact that every PE of a systolic array (150) needs to betriggered by two inputs (i.e., from the West and from the Northdirections) simultaneously. In addition, a PE may function as a memorynode for the parameters from the last matrix. Therefore, starting zeros(102) and ending zeros (104) are inserted to the matrix to ensure thatevery PE is properly reset between two consecutive column matrix inputs(106 a, 106 b). In addition, four delay units (−1A to −1D) (see FIG. 1A)are used in the systolic array (150) to synchronize the consecutivecolumn matrix inputs (106 a, 106 b) to each other. Accordingly, theoutput data streams (142 a to 142 d) as inversed matrix (e.g., 116) fromthe systolic array (150) are also skewed by the same starting zeros(112) and ending zeros (114), respectively.

It may be pointed out that multiple users in MIMO (MU-MIMO) requirematrix inversions to be performed in parallel. In this regard,performing 2×2 matrix inversions in parallel (not shown) for MU-MIMO mayrequire four independent systolic arrays, which may contain a total of 4QR-PEs, 4 BS-PEs, 8 BSD-PEs and 8 delay units (not shown). Whenconfiguring a 16 PEs systolic array (150) of 4×4 matrix inversion (e.g.,see FIG. 1A) to perform a 2×2 matrix inversion, only two systolic arraysmay be achieved from the 16 PEs systolic array, thus wasting a largeportion of the QR-PEs and BS-PEs. In addition, the latency and the powerconsumption of running all 64 CORDIC cores may not reduce significantlywhen performing the 2×2 matrix inversion.

Therefore, even though the above systolic array (150) hardware matrixinversion algorithm approach has the advantages of high processing speedand high throughput, it nevertheless has at least the followingdisadvantages: (1) inflexible architecture, which cannot be scaled orconfigured to adapt to systems using different MIMO dimensions (i.e.,having more or fewer antennas); (2) fixed precision, which cannot beconfigured to achieve a higher or lower precision based on systemarchitecture or performance requirement; (3) high latency, the highnumber of CORDIC cores used and number of iterations required for theinversion algorithm, and the need of synchronizing input matrix columnsby inserting starting zeros (102) and ending zeros (104) to resetmemories causes processing delays, especially when the matrix sizeincreases for large MIMO dimensions; (4) increase area size and powerconsumption due to high number of CORDIC cores in each PE, especiallywhen the matrix size increases for large MIMO dimensions.

SUMMARY

The disclosure addresses the above disadvantages in both software andhardware matrix inversion algorithms approach by implementing abi-directional ring-bus architecture using CORDIC cores to performmatrix inversion of input digital signal streams from multiple antennas.

In one aspect, a method for processing multiple antenna signalsincludes: receiving input signals from multiple antennas, converting thereceived input signals into a plurality of input digital signal streams,and performing QR decomposition (QRD) based matrix inversion on theplurality of input digital signal streams via a bi-directional ring busprocessing architecture. The bi-directional ring bus processingarchitecture includes a bi-directional ring bus, with a first ring bushaving signals flow in a clockwise direction, a second ring bus havingsignals flow in a counter-clockwise direction. An input/output (/O)controller is coupled to the first and the second ring bus,respectively. The bi-directional ring bus processing architecture alsoincludes a plurality of processing elements (PEs). Each of the pluralityof PEs is coupled to the first and the second ring bus, respectively,and includes at least one Coordinate Rotation Digital Computer (CORDIC)core for performing CORDIC iterations on the plurality of input digitalsignal streams to produce inversed matrix signals.

In a second aspect, an apparatus for processing multiple antenna signalsinclude a receiver which receives input signals from multiple antennas;one or more processing circuits which processes the received inputsignals into a plurality of input digital signal streams; abi-directional ring bus processing architecture which performs QRdecomposition (QRD) based matrix inversion on the plurality of inputdigital signal streams, wherein the bi-directional ring bus processingarchitecture includes: a bi-directional ring bus, with a first ring bushaving signals flow in a clockwise direction, a second ring bus havingsignals flow in a counter-clockwise direction; an input/output (/O)controller coupled to the first and the second ring bus, respectively; aplurality of processing elements (PE), each of the plurality of PEs iscoupled to the first and the second ring bus, respectively, wherein eachof the plurality of PEs comprises at least one Coordinate RotationDigital Computer (CORDIC) core for performing CORDIC iterations on theplurality of input digital signal streams to produce inversed matrixsignals.

In a third aspect, a matrix computation unit includes a bi-directionalring bus processing architecture which performs QR decomposition (QRD)based matrix inversion on a plurality of input digital signal streams,wherein the bi-directional ring bus processing architecture having: abi-directional ring bus, with a first ring bus having signals flow in aclockwise direction, a second ring bus having signals flow in acounter-clockwise direction; an input/output (I/O) controller coupled tothe first and the second ring bus, respectively; a plurality ofprocessing elements (PE), each of the plurality of PEs is coupled to thefirst and the second ring bus, respectively, wherein each of theplurality of PEs comprises at least one Coordinate Rotation DigitalComputer (CORDIC) core for performing CORDIC iterations on the pluralityof input digital signal streams to produce inversed matrix signals.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are included to provide a furtherunderstanding of the claims, are incorporated in, and constitute a partof this specification. The detailed description and illustratedembodiments described serve to explain the principles defined by theclaims.

FIG. 1A illustrates a systolic array in a related art, performing a 4×4matrix inversion on LTE digital signal streams;

FIG. 1B illustrates a systolic array of FIG. 1 in a related artperforming a 4×4 matrix inversion on LTE digital signal streams, withthe column data of input matrix being deliberately skewed prior toprocessing;

FIG. 2A is a flow chart, which illustrates exemplary steps ofimplementing a bi-directional ring-bus architecture using CORDIC coresto perform matrix inversion of input digital signal streams frommultiple antennas, according to an embodiment of disclosure;

FIG. 2B is an exemplary system block diagram illustrating animplementation of a bi-directional ring-bus architecture using CORDICcores to perform matrix inversion of input digital signal streams frommultiple antennas, according to an embodiment of disclosure;

FIG. 3A illustrates a configuration of implementing a bi-directionalring-bus architecture using CORDIC cores to perform matrix inversion ofinput digital signal streams from multiple antennas, according to anembodiment of disclosure. More specifically, FIG. 3A illustrates aconfiguration of implementing a bi-directional ring-bus architectureusing CORDIC cores to perform four 2×2 matrix inversions in parallel toprocess input digital signal streams from four multiple antenna ports,according to an embodiment of disclosure;

FIG. 3B illustrates an exemplary input/output (I/O) controller coupledto the bi-directional ring bus architecture of FIG. 3A, which initiatestokens to processing elements (PEs) to perform matrix inversion of inputdigital signal streams from multiple antennas, according to anembodiment of disclosure;

FIG. 3C illustrates an exemplary processing element (PE) coupled to thebi-directional ring-bus architecture of FIG. 3A to perform matrixinversion of input digital signal streams from multiple antennas,according to an embodiment of disclosure;

FIG. 3D illustrates implementing a bi-directional ring-bus architectureof FIG. 3A to perform matrix inversion of input digital signal streamsfrom multiple antennas, without requiring the column data of inputmatrix deliberately skewed prior to processing, according to anembodiment of disclosure;

FIG. 3E illustrates using a multiple bi-directional ring-busconfiguration to perform matrix inversion of input digital signalstreams from an increased multiple antennas size, according to anembodiment of disclosure; and

FIG. 3F illustrates using the same single bi-directional ring-busarchitecture of FIG. 3A to perform matrix inversion of input digitalsignal streams from an increased multiple antennas size, according to anembodiment of disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The problems described above are overcome by providing a method andsystem for processing digital signal streams using a bi-directionalring-bus processing architecture and using CORDIC cores to performmatrix inversion of input digital signal streams from multiple antennas.

FIG. 2A is a flow chart, which illustrates exemplary operations forimplementing a bi-directional ring-bus architecture using CORDIC coresto perform matrix inversion of input digital signal streams frommultiple antennas, according to an embodiment of disclosure. FIG. 2B isan exemplary system block diagram (200), which implements bi-directionalring-bus architecture using CORDIC cores to perform matrix inversion ofinput digital signal streams from multiple antennas. For expediency,FIG. 2B will be referenced to when describing the exemplary steps inFIG. 2A.

At block S1, illustrates that MIMO signals (i.e., RF channels (222 a to222 d)), which are received from respective multiple antenna ports (224a to 224 d), are converted into digitized signals (228 a to 228 d). A RFreceiver having an analog to digital converter (ADC) may facilitate suchconversion. The digitized signals (228 a to 228 d) may further beprocessed by respective antenna processing units (230 a to 230 d) into aplurality of respective digital signal streams (232 a to 232 d), havingrespective channel coefficients (H) extracted ready for equalization ormatrix inversion calculation in a MIMO equalization processing block(234).

At block S2, the plurality of digital signal streams (232 a to 232 d)are sent to a bi-directional ring-bus architecture (250) forequalization, where matrix inversion operations on the received digitalsignal streams (232 a to 232 d) may take place. More specifically, thebi-directional ring-bus architecture includes an input/output (I/O)controller, a plurality of processing elements (PEs), and a ring-bushaving a clockwise ring and a counter-clockwise ring (bi-directionalring-bus).

At block S3, the (I/O) controller (of the bi-directional ring-busarchitecture) generates two respective initial tokens designated to aspecific processing element (PE) within the bi-directional ring-busarchitecture. Instructions are preloaded into a control block in each ofthe PEs within the bi-directional ring-bus architecture. The respectiveinitial tokens carry embedded data from a respective digital signalstream, and the respective initial tokens are respectively transmittedor dispatched to the clockwise ring as clockwise ring token and to thecounter-clockwise ring as counter-clockwise ring token.

At block S4, the designated PE awaits the arrival of the designatedclockwise ring token and the counter-clockwise ring token. The arrivalof both the designated clockwise ring token and the counter-clockwisering token may trigger the designated PE to start processing theembedded data according to the instructions preloaded into the PE. Theprocessing may include at least executing algorithms to perform one of:QRD, back substitution or back substitution delay operations.

At block S5, after the designated PE completes the processing of theembedded data according to the instructions preloaded into the PE, thedesignated PE may generate and dispatch another new clockwise ring tokenand counter-clockwise ring token to a next designated PE to carry out anext processing step.

The designated PE may repeat the operations between blocks S4 and blockS5, until the arrival of a next clockwise ring token and a nextcounter-clockwise ring token to trigger processing of a next data.Otherwise the operations continue from block S6.

At block S6, if the instructions preloaded into the PE indicate that thepending processing by the designated PE is the final step, thedesignated PE may generate and transmit a terminating clockwise ringtoken and a terminating counter-clockwise ring token to the I/Ocontroller. In other words, the inversed matrix operation on thereceived data could have been completed by step S6.

At block S7, the I/O controller may output or transmit the completeddata from the received terminating clockwise ring token and theterminating counter-clockwise ring token to be further processed byanother processing block, such as to a DSP, to another processor, or toanother bi-directional ring-bus architecture for another round of matrixinversion (see FIG. 3E). Alternately, the I/O controller may reuse thecompleted data, and re-initiate a new clockwise ring token and a newcounter-clockwise ring token, and to restart another round of matrixinversion (see FIG. 3F).

FIG. 3A illustrates a configuration for implementing a bi-directionalring-bus architecture (350) using CORDIC cores to perform matrixinversion of input digital signal streams (332 a to 332 d) from multipleantennas, according to an embodiment of disclosure. The bi-directionalring-bus architecture (350) may include an input/output (I/O) controller(352), a plurality of processing elements (such as PE1 to PE16), aclockwise ring (360) and a counter-clockwise ring (370). The I/Ocontroller (352) and the plurality of PEs (PE1 to PE16) are each coupledto the clockwise ring (360) and the counter-clockwise ring (370),respectively.

One advantage of the ring-bus architecture is that it provides a widerange of scalability. For example, the 16 PEs ring-bus (350)architecture illustrated in FIG. 3A may be configured to invert one 4×4matrix (i.e., matrix 1 to matrix 4), invert one 3×3 matrix (i.e., any 9PEs), four 2×2 matrixes in parallel (i.e., anyone of matrix 1 to 4), and16 1×1 matrix in parallel (i.e., using only a single PE).

As described later with reference to FIG. 3F, that the same 16 PEsring-bus (350B), may be configured to invert one 8×8 matrix byconfiguring the matrix 1 to matrix 4 within the ring-bus (350B) into twomatrix groups (i.e., matrix 1-2 and matrix 3-4), and iterativelyprocessing the two matrix groups four times. FIG. 3E illustrates thatthe 8×8 matrix inversion may alternately be performed by cascading four16 PEs ring-bus (350-1 to 350-4), where each of the ring-bus (350-1 to350-4) is configured as a two matrix groups (i.e., matrix 1-2 and matrix3-4). As described later with reference to FIG. 3F, that the same 16 PEsring-bus (350B) that is designed to invert one 4×4 matrix may beconfigured to invert one 8×8 matrix by iteratively processing fourtimes. FIG. 3E illustrates that one 8×8 matrix inversion may alternatelybe performed by cascading four 16 PEs ring-bus (350-1 to 350-4), whereeach of the ring-bus (350-1 to 350-4) is configured to processing twocolumns (sub-layer) of the 8×8 matrix inputs.

In this regard, the ring-bus architecture (350) may be flexiblyconfigured to support a wide range of antenna dimensions, such as an LTE1×1 real reciprocal on every tone (100×12) in one time interval (TTI),an LTE V 2×2 complex matrix inversion for each V-MIMO pair on every tone(80×12) in one TTI, LTE with 4×4 complex matrix inversion, and LTE-Awith 8×8 complex matrix inversion, to name a few.

FIG. 3B illustrates an exemplary input/output (I/O) controller (352)coupled to the bi-directional ring bus architecture of FIG. 3A, whichinitiates tokens to designated processing elements (PEs) to performmatrix inversion of input digital signal streams (332) from multipleantennas, according to an embodiment of disclosure. The I/O controller(352) facilitates input of data stream signals (332 a to 332 d), andfacilitates completed and processed output signals (342 a to 342 d) toand from the clockwise ring (360) and the counter-clockwise ring (370).In addition, the I/O controller (352) includes an acknowledgment module(352-1), which receives acknowledgment signals (ACK) returned by the PEsin the ring-bus; a token generator (352-2), which generates initialclockwise tokens and initial counter-clockwise tokens for a designatedPE in the ring-bus; and a result capture module (352-3), which capturesreturned results returned by the PEs in the ring-bus.

FIG. 3C illustrates an exemplary processing element (PE) (355), which iscoupled to the bi-directional ring-bus architecture of FIG. 3A toperform matrix inversion of input digital signal streams from multipleantennas, according to an embodiment of disclosure. The PE (355)includes at least a first latch (355-1), a second latch (355-2), asingle CORDIC core (355-3), a register (355-4) and a control unit(355-5). The bi-directional ring-bus architecture includes a clockwisering (360) and a counter-clockwise ring (370). The first latch (355-1)is coupled to the clockwise ring (360) at a first dock (362), and thesecond latch (355-2) is coupled to the counter-clockwise ring (370) at asecond dock (372).

The clockwise ring (360) may facilitate transmission or movement ofsignals or information in a horizontal direction (West to East), and thecounter-clockwise ring (370) may facilitate transmission or movement ofsignals or information in a vertical direction (North to South), as in asystolic array (150) as shown in FIG. 1A. Each ring (clockwise ring(360) or counter-clockwise ring (370)) includes three sub-rings, namely,a Request ring (360-A), a Data ring (360-B) and an Acknowledgment ring(360-C).

The signals or information, which passes through the clockwise ring(360) and the counter-clockwise ring (370), may be designated as “token”(381), or more specifically, as clockwise token or counter-clockwisetoken, respectively. The first dock (362) and the second dock (372) ofthe PE (355) are responsible for initiating dispatch of, and accepting aclockwise token and a counter-clockwise token, respectively. The firstdock (362) and the second dock (372) may either be in an active mode(ready to accept a clockwise token and a counter-clockwise token) or inan inactive mode (the PE (355) is busy processing the clockwise tokenand the counter-clockwise token).

The PE (355) in the ring-bus may be driven or triggered by two tokens,namely, the clockwise token and the counter-clockwise token. The PE(355) may also issue a new clockwise token and a new counter-clockwisetoken for another PE in the ring-bus, upon completion in processing thepending clockwise token and the counter-clockwise token. Each token(381) includes information, namely, a PE Address (381-1 a) with asequence stamp (SEQ) (381-1 b), Data (381-2) and an Acknowledgment (ACK)(381-3).

The PE Address (381-1 a) and the sequence stamp (SEQ) (381-1 b) are bothcarried on the Request ring (360-A). The Data (381-2) is carried on theData ring (360-B), and the Acknowledgment (ACK) (381-3) signal iscarried on the Acknowledgment ring (360-C), respectively. The PE address(381-1 a) on the token (381) corresponds to an address or location ofthe PE (355) on the Request ring (360-A).

The SEQ (381-1 b) carries a sequence time stamp, which is used toresolve token order information (i.e., when the clockwise token or thecounter-clockwise token is generated). The PE (355) on thebi-directional ring-bus can only be triggered if both the clockwisetoken and the counter-clockwise token have matching PE-addresses (381-1a) and a matching SEQ (381-1 b) are captured by the first dock (362) andby the second dock (372), respectively.

The SEQ (381-1 b) may facilitate asynchronous token processing by thePEs on the bi-directional ring-bus. For example, suppose that PE5 on thebi-directional ring-bus processes tokens (in both directions) fasterthan PE6. PE5 may have already generated and issued more than one token,say token 0 and token 1 (in both directions) to PE6. Since PE6 may takea longer time to process its pending token (in both directions), thefirst and the second docks of the PE6 may be in an “inactive mode”(which indicates that PE6 is busy processing pending tokens). PE6 maytherefore, capture neither token 0 nor token 1 (in either or bothdirections). Accordingly, both token 0 and token 1 would stay on theRequest ring (360-A) (in both directions), until PE6 becomes availableor free.

The first and second docks (362, 372) on the PE6 may scan all passingclockwise tokens and counter-clockwise tokens on the clockwise ring(360) and the counter-clockwise ring (370), respectively. When theclockwise token and the counter-clockwise token destined to PE6 arrive,the first and second docks (362, 372) of PE6 first check the sequencestamp SEQ (381-b). If the SEQ (381-b) on both the clockwise token andthe counter-clockwise token match, the first latch (355-1) and thesecond latch (355-2) of PE6 are triggered.

However, there is no guarantee that token 0 (with an earlier sequencestamp SEQ (381-b)) would be captured before token 1 (in both directions)by PE6. In this regard, the sequence stamp SEQ (381-1 b) may help PE6decide whether to accept token 1 before token 0. If PE6 is configured toprocess token 0 before token 1 from PE5, the first and second docks(362, 372) on the PE6 may scan all passing clockwise tokens andcounter-clockwise tokens for token 0 from PE5.

Meanwhile, the other PEs (e.g., PE5) in the ring-bus may independentlycontinue to process other new tokens in the bi-directional ring-busaccording to the matched PE address (381-1 a) and the matched sequencestamp SEQ (381-1 b) (in both directions), irrespective of whether PE6 orother PEs in the bi-directional ring-bus start to process their tokensor in waiting mode. In this regard, the sequence stamp SEQ (381-1 b) mayhelp facilitate asynchronous token processing by the PEs on thebi-directional ring-bus. The token processing by the PEs on thebi-directional ring-bus, therefore, follows the sequence time stamp SEQ(381-b). Yet in another embodiment of the disclosure, the PE may beconfigured to process tokens following a first-in-first-out (FIFO)fashion or in a round robin queue fashion for other processing schemes.

Configuration codes may be pre-loaded to the control unit (355-5) toenable the PE (355) to carry out a QR decomposition (QRD) type, aback-substitution (BS) type or back-substitution delay (BSD) type ofoperation via the single CORDIC core (355-3) on the PE (355).

The embedded Data (381-2) associated with the PE address (381-a) and thesequence stamp SEQ (381-1 b) in the token (381) are transmitted in theData ring (360-B) (in both directions). More specifically, the embeddedData (381-2) may include at least data of the digital signal streams(332 a to 332 d), which are to be loaded into the plurality of registers(355-4) of the PE. QRD type, BS type or BSD type of operation may becarried out by the single CORDIC core (355-3) using the data stored inthe plurality of registers (355-4), according to the pre-loaded codes ofthe control unit (355-5).

The Acknowledgment ACK (381-3) signal is generated by the PE (355) oncethe PE (355) has completed its operation on the embedded Data (381-2) onthe tokens (in both directions). The ACK (381-3) is sent to the I/Ocontroller (352) on the Acknowledgment ring (360-C) (in both directions)to signal to the I/O controller (352) that the processing of the pendingtokens (in both directions) has been completed. The I/O controller (352)monitors the ACK (381-3) signals from the PEs on the bi-directionalring-bus.

Once the I/O controller (352) receives one or more ACK (381-3) signalsfrom one or more designated PEs (355), the I/O controller (352) mayinput one or more new signals and initiate corresponding new tokens (inboth directions) into the bi-directional ring-bus to the one or moredesignated PEs (355). In this way, the ACK (381-3) signal acts as acontrol signal to keep the I/O controller (352) from initiating andsending too many, or initiating and sending too few tokens to thebi-directional ring-bus.

FIG. 3D illustrates an implementation of a bi-directional ring-busarchitecture of FIG. 3A that performs matrix inversion of input digitalsignal streams from multiple antennas, without requiring the column dataof the input matrix to be deliberately skewed prior to processing,according to an embodiment of disclosure. In a sense, the bi-directionalring-bus architecture of FIG. 3A may be realized as a “ring-bus”systolic array (350A) to support 4×4 matrix inversion operations of atraditional systolic array (150) as shown in FIG. 1B.

As described in FIG. 3C, digital signal streams (332 a to 332 d) (astokens) are horizontally processed (West to East) by the clockwise ring(360), and vertically processed (North to South) by thecounter-clockwise ring (370). In addition, the token processing by thePEs on the bi-directional ring-bus is asynchronous, independent of thestart time and completion time of the previous PE or any other PEs inthe bi-directional ring-bus. To simply put, the individual ring-bus PEdoes not need to wait for the longest processing time among all theother PEs before issuing or transmitting the results (as new tokens) tothe ring-bus. In this regard, the “starting zeros” and “ending zeros”are no longer necessary in the bi-directional ring-bus architecture.Accordingly, the input matrixes (306, 308) are not skewed, as shown inFIG. 3D. Likewise, the output inversed matrixes (312, 316) are also notskewed. Therefore, compared to the traditional systolic array (150) (seeFIG. 1B), the bi-directional ring-bus architecture is more superior inreduction of latency by virtue of removing the processing time of“starting zeros and ending zeros” in the input matrixes thus eliminatingthe waiting time of the longest operation to complete.

In addition, FIG. 3C also illustrates that only a single CORDIC core(355-3) is used in each PE (355). Therefore, a 4×4 matrix inversion inthe bi-directional ring-bus architecture may be accomplished by usingonly 16 CORDIC cores, compared to 64 CORDIC cores required in thetraditional systolic array (150). In this regard, the reduction ofCORDIC cores further reduces latency by virtue of fewer CORDIC coreiterations within the PE, and thus substantially reduces the powerconsumption for the matrix inversion operations. Furthermore, the tokenrotation frequency within the bi-directional ring-bus may be increasedto increase CORDIC iterations or processing, which may improveprecision, or further reduce latency on matrix inversion operations.

FIG. 3E illustrates using a multiple bi-directional ring-busconfiguration to perform matrix inversion of input digital signalstreams from an increased multiple antennas size, according to anembodiment of disclosure. FIG. 3E illustrates that the 8×8 matrixinversion may alternately be performed by cascading four 16 PEs ring-bus(350-1 to 350-4), where each of the ring-bus (350-1 to 350-4) isconfigured as a two matrix groups (i.e., matrix 1-2 and matrix 3-4).where each of the ring-bus (350-1 to 350-4) is configured to processingtwo columns (i.e., a sub-layer) of the 8×8 matrix inputs. Data of theinput digital signal streams (332 a to 332 d) may enter the I/Ocontroller of ring-bus (350-1) (as tokens), and exit through the sameI/O controller after the first sub-layer is processed. The processeddata (as tokens) may be transmitted to complete the remaining six matrixinversions by the ring-buses (350-2 to 350-4). The benefit of using acascading configuration is to improve speed, especially when a largematrix dimension needs to be inversed. An optional cache memory (390A)may be used between ring-buses as data buffer.

FIG. 3F illustrates using the same single bi-directional ring-busarchitecture of FIG. 3A to perform matrix inversion of input digitalsignal streams from an increased multiple antennas size, according to anembodiment of disclosure. It may further be shown that the same 16 PEsring-bus (350B) may be configured to invert one 8×8 matrix byiteratively processing the four sub-layers (two columns) of the 8×8matrix input. A cache can be used to buffer the outputs of eachsub-layer. Data of the input digital signal streams (332 a to 332 d) mayenter the I/O controller of ring-bus (350-B) (as tokens) to complete thefirst sub-matrix processing. The tokens are recirculated through thesame I/O controller to perform the next three iterations to complete the8×8 matrix inversions.

It should be pointed out that the above bi-directional ring-busarchitecture with CORDIC cores may be implemented as an integratedcircuit (IC) chip individually, or in combination with other processingblocks such as DSP, ALUs, controllers or microprocessors to expandhardware/software processing capabilities or applications. Thebi-directional ring-bus architecture with CORDIC cores may beimplemented in mobile and network devices, such as in base stations,network router, network switches, mobile handsets, wireless tablets,game consoles, video graphics interface, to name a few.

The following summarizes some of the advantages of using abi-directional ring-bus architecture over a traditional systolic array(150):

-   -   (1) Latency reduction in matrix inversion by virtue of        eliminating “starting and ending zeros”, using fewer CORDIC        cores.    -   (2) Configurable precision. Precision is proportional to the        iterations of processing by the CORDIC cores. Processing        iterations may be increased by increasing tokens rotation        frequency in the ring-bus.    -   (3) Lower power consumption by virtue of using a single CORDIC        core in the PE. Unlike in a traditional systolic array, one node        may have three or four CORDIC cores, in contrast with the        ring-bus architecture, where only a single CORDIC core is needed        for the node.    -   (4) Reduction of chip size area by virtue of a less complex        design using fewer CORDIC cores.    -   (5) Flexible to adapt to any antenna dimensions (e.g., 1×1, 2×2,        3×3, 4×4, 8×8 or larger antenna dimension) in the system by        configuring the PEs in the bi-directional ring-bus architecture        to any matrix size to perform parallel or serial inverse matrix        processing, and performing multiple iterations on the same        bi-directional ring-bus architecture with space reduction.    -   (6) Scalable to cascade multiple ring-bus architecture for speed        and to handle increased system complexity.    -   (7) The ring-bus architecture is a good candidate for        asynchronous (self-timed) hardware design which is more        flexible. In this regard, asynchronous processing does not have        to wait for a previous step to complete before starting the        processing, unlike the traditional systolic array where the        processing is contingent on the result of the previous step.

Those of ordinary skill in the art should understand that all or a partof the steps in the method according to the embodiments of the presentdisclosure can be implemented by a program instructing relevanthardware, and the program may be stored in a non-transitory computerreadable storage medium, such as a ROM/RAM, a magnetic disk, or anoptical disk, which are executed in a machine, such as an end-usermobile device, in a server, or cloud computing infrastructure.

It will be apparent to those skilled in the art that variousmodifications and variations can be made to the present disclosurewithout departing from the scope or spirit of the disclosure. In view ofthe foregoing, it is intended that the present disclosure covermodifications and variations of this disclosure provided they fallwithin the scope of the following claims and their equivalents.

What is claimed is:
 1. A method for processing multiple antenna signals,comprising: receiving input signals from multiple antennas; convertingthe received input signals into a plurality of input digital signalstreams; and performing QR decomposition (QRD) based matrix inversion onthe plurality of input digital signal streams via a bi-directional ringbus processing architecture that comprises: a bi-directional ring bus,with a first ring bus having signal flow in a clockwise direction, asecond ring bus having signal flow in a counter-clockwise direction; aninput/output (I/O) controller coupled to the first and the second ringbus, respectively; a plurality of processing elements (PEs), each of theplurality of PEs is coupled to the first and the second ring bus,respectively, wherein each of the plurality of PEs comprises at leastone Coordinate Rotation Digital Computer (CORDIC) core for performingCORDIC iterations on the plurality of input digital signal streams toproduce inversed matrix signals.
 2. The method according to claim 1,wherein each of the plurality of PEs comprises a control unit, at leastone register, an input latch and an output latch, wherein the inputlatch and the output latch are each coupled via a corresponding controldock of the first ring bus and a corresponding control dock of thesecond ring bus.
 3. The method according to claim 2, wherein each of theplurality of PEs is triggered to process the plurality of input digitalsignal streams when both the input latch and the output latches are eachtriggered by a respective first and second token received from the firstand the second ring bus.
 4. The method according to claim 2, furthercomprising pre-loading configuration codes to the control unit in eachof the plurality of PEs to enable each of the plurality of PEs toperform at least one of: a QR decomposition, a back-substitution (BS),and a back-substitution delay (BSD) type processing.
 5. The methodaccording to claim 1, wherein the bi-directional ring bus processingarchitecture operates in an asynchronous mode, wherein each of theplurality of PEs starts and ends processing of the plurality of inputdigital signal streams independently, without having to synchronize theprocessing with another PE in the bi-directional ring bus processingarchitecture at any time instant.
 6. The method according to claim 1,wherein each of the plurality of PEs processes the plurality of inputdigital signal streams without having to insert starting zeros or endingzeros to anyone of the plurality of input digital signal streams.
 7. Themethod according to claim 1, wherein the I/O controller performsfunctions comprising one or more of: inputting the plurality of inputdigital signal streams, outputting results from the plurality of PEs inthe bi-directional ring bus architecture, monitoring acknowledges andgenerating initial tokens for one or more of the plurality of PEs. 8.The method according to claim 1, further comprising cascading aplurality of the bi-directional ring bus processing architecture toincrease inverse matrix processing capacity, when higher number ofmultiple antennas are used.
 9. The method according to claim 1, furthercomprising integrating the bi-directional ring bus processingarchitecture into a digital signal processing (DSP) core to expandhardware/software processing functions and to reduce chip size.
 10. Themethod according to claim 9, wherein the DSP core comprises an executioncore and a matrix computation block, wherein the execution core executesop codes and only a single op code to trigger the matrix computationblock.
 11. An apparatus for processing multiple antenna signalscomprises: a receiver which receives input signals from multipleantennas; one or more processing circuits which processes the receivedinput signals into a plurality of input digital signal streams; abi-directional ring bus processing architecture which performs QRdecomposition (QRD) based matrix inversion on the plurality of inputdigital signal streams, wherein the bi-directional ring bus processingarchitecture comprises: a bi-directional ring bus, with a first ring bushaving signals flow in a clockwise direction, a second ring bus havingsignals flow in a counter-clockwise direction; an input/output (I/O)controller coupled to the first and the second ring bus, respectively; aplurality of processing elements (PE), each of the plurality of PEs iscoupled to the first and the second ring bus, respectively, wherein eachof the plurality of PEs comprises at least one Coordinate RotationDigital Computer (CORDIC) core for performing CORDIC iterations on theplurality of input digital signal streams to produce inversed matrixsignals.
 12. The apparatus according to claim 11, wherein each of theplurality of PEs comprises a control unit, at least one register, aninput latch and an output latch, wherein the input latch and the outputlatch are each coupled via a corresponding control dock of the firstring bus and a corresponding control dock of the second ring bus, 13.The apparatus according to claim 12, wherein each of the plurality ofPEs is triggered to process the plurality of input digital signalstreams when both the input latch and the output latch are eachtriggered by a respective first and second token received from the firstand the second ring bus.
 14. The apparatus according to claim 12,wherein the control unit in each of the plurality of PEs comprisespre-loaded configuration codes, wherein each of the plurality of PEs hasbeen enabled to perform at least one of: a QR decomposition, aback-substitution (BS) and a back-substitution delay (BSD) typeprocessing.
 15. The apparatus according to claim 11, wherein thebi-directional ring bus processing architecture operates in anasynchronous mode, wherein each of the plurality of PEs starts and endsprocessing of the plurality of input digital signal streamsindependently, without having to synchronize the processing with anotherPE in the bi-directional ring bus processing architecture at any timeinstant.
 16. The apparatus according to claim 11, wherein each of theplurality of PEs processes the plurality of input digital signal streamswithout having to insert starting zeros or ending zeros to anyone of theplurality of input digital signal streams.
 17. The apparatus accordingto claim 11, wherein the apparatus comprises a network device selectedfrom one of: a base station, a wireless handset, a wireless tabletdevice, a game console, a network router or a network switch, andwherein the I/O controller is enabled to perform one or more of: inputthe plurality of input digital signal streams, output results from theplurality of PEs in the bi-directional ring bus architecture, monitoracknowledges and generate initial tokens for one or more of theplurality of PEs.
 18. The apparatus according to claim 11, wherein aplurality of the bi-directional ring bus processing architecture arecascaded to increase the inverse matrix processing capacity, when highernumber of multiple antennas are used.
 19. The apparatus according toclaim 11, wherein the bi-directional ring bus processing architecturehas been integrated with a digital signal processing (DSP) core as asingle chip to expand hardware/software processing functions and toreduce chip size, wherein the DSP core comprises an execution core and amatrix computation block, wherein the execution core executes op codesand only a single op code to trigger the matrix computation block.
 20. Amatrix computation unit comprises: a bi-directional ring bus processingarchitecture which performs QR decomposition (QRD) based matrixinversion on a plurality of input digital signal streams, wherein thebi-directional ring bus processing architecture comprises: abi-directional ring bus, with a first ring bus having signals flow in aclockwise direction, a second ring bus having signals flow in acounter-clockwise direction; an input/output (I/O) controller coupled tothe first and the second ring bus, respectively; a plurality ofprocessing elements (PE), each of the plurality of PEs is coupled to thefirst and the second ring bus, respectively, wherein each of theplurality of PEs comprises at least one Coordinate Rotation DigitalComputer (CORDIC) core for performing CORDIC iterations on the pluralityof input digital signal streams to produce inversed matrix signals.